The Annals of Statistics 
2012, Vol. 40, No. 2, 785-811 
DOI: 10.1214/12-AOS978 

@ Institute of Mathematical Statistics, 2012 



PERTURBATION AND SCALED COOK'S DISTANCE 

By Hongtu Zhu\ Joseph G. Ibrahim^ and Hyunsoon Cho 

University of North Carolina at Chapel Hill 

Cook's distance [Technometrics 19 (1977) 15-18] is one of the 
most important diagnostic tools for detecting influential individual 
or subsets of observations in linear regression for cross-sectional data. 
However, for many complex data structures (e.g., longitudinal data), 
no rigorous approach has been developed to address a fundamental 
issue: deleting subsets with different numbers of observations intro- 
duces different degrees of perturbation to the current model fitted to 
the data, and the magnitude of Cook's distance is associated with 
the degree of the perturbation. The aim of this paper is to address 
this issue in general parametric models with complex data struc- 
tures. We propose a new quantity for measuring the degree of the 
perturbation introduced by deleting a subset. We use stochastic or- 
dering to quantify the stochastic relationship between the degree of 
the perturbation and the magnitude of Cook's distance. We develop 
several scaled Cook's distances to resolve the comparison of Cook's 
distance for different subset deletions. Theoretical and numerical ex- 
amples are examined to highlight the broad spectrum of applications 
of these scaled Cook's distances in a formal influence analysis. 

1. Introduction. Influence analysis assesses whether a modification of 
a statistical analysis, called a perturbation (see Section 2.2), seriously af- 
fects specific key inferences, such as parameter estimates. Such perturbation 
schemes include the deletion of an individual or a subset of observations, 
case weight perturbation and covariate perturbation, among many others 
[8, 9, 28]. For example, for linear models, a perturbation measures the effect 
on the model of deleting a subset of the data matrix. In general, perturba- 
tion measures do not depend on the data directly, but rather on its structure 
via the model. If a small perturbation has a small effect on the analysis, our 
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analysis is relatively stable, while if a large perturbation has a small effect 
on the analysis, we learn that our analysis is robust [11, 16]. If a small per- 
turbation seriously influences key results of the analysis, we want to know 
the cause [9, 11]. For instance, in influence analysis, a set of observations is 
flagged as "influential" if its removal from the dataset produces a significant 
difference in the parameter estimates or, equivalently, a large value of Cook's 
distance for the current statistical model [5, 8]. 

Since the seminal work of Cook [8] on Cook's distance in linear regression 
for cross-sectional data, considerable research has been devoted to devel- 
oping Cook's distance for detecting influential observations (or clusters) in 
more complex data structures under various statistical models [1, 6, 8, 10, 12, 
14, 15, 23, 29]. For example, for longitudinal data, Preisser and Qaqish [19] 
developed Cook's distance for generalized estimating equations, while Chris- 
tensen, Pearson and Johnson [7], Banerjee and Frees [4] and Banerjee [3] 
considered case deletion and subject deletion diagnostics for linear mixed 
models. Furthermore, in the presence of missing data, Zhu et al. [29] devel- 
oped deletion diagnostics for a large class of statistical models with missing 
data. Cook's distance has been widely used in statistical practice and can 
be calculated in popular statistical software, such as SAS and R. 

A major research problem regarding Cook's distance that has been largely 
neglected in the existing literature is the development of Cook's distance for 
general statistical models with more complex data structures. The funda- 
mental issue that arises here is that the magnitude of Cook's distance is 
positively associated with the amount of perturbation to the current model 
introduced by deleting a subset of observations. Specifically, a large value of 
Cook's distance can be caused by deleting a subset with a larger number of 
observations and/or other causes such as the presence of influential observa- 
tions in the deleted subset. To delineate the cause of a large Cook's distance 
for a speciflc subset, it is more useful to compute Cook's distance relative 
to the degree of the perturbation introduced by deleting the subset [11, 28]. 

The aim of this paper is to address this fundamental issue of Cook's 
distance for complex data structures in general parametric models. The main 
contributions of this paper are summarized as follows: 

(a.l) We propose a quantity to measure the degree of perturbation in- 
troduced by deleting a subset in general parametric models. This quantity 
satisfies several attractive properties including uniqueness, nonnegativity, 
monotonicity and additivity. 

(a. 2) We use stochastic ordering to quantify the relationship between the 
degree of the perturbation and the magnitude of Cook's distance. Particu- 
larly, in linear regression for cross-sectional data, we first show the stochastic 
relationship between the Cook's distances for any two subsets with possibly 
different numbers of observations. 
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(a. 3) We develop several scaled Cook's distances and their first-order ap- 
proximations in order to compare Cook's distance for deleted subsets with 
different numbers of observations. 

The rest of the paper is organized as follows. In Section 2, we quantify 
the degree of the perturbation for set deletion and delineate the stochastic 
relationship between Cook's distance and the degree of perturbation. We 
develop several scaled Cook's distances and derive their first-order approx- 
imations. In Section 3, we analyze simulated data and a real dataset using 
the scaled Cook's distances. We give some final remarks in Section 4. 

2. Scaled Cook's distance. 

2.1. Cook's distance. Consider the probability function of a random vec- 
tor Y'^ = {Y^, . . . , yj), denoted by p{Y\e), where 6 = {61,..., Oqf is a (7 x 1 
vector in an open subset of and Yi = . . . ,yi,mi), in which the di- 
mension of Yi, denoted by rrii, may vary significantly across all i. Cook's 
distance and many other deletion diagnostics measure the distance between 
the maximum likelihood estimators of with and without Yi [8, 10]. A sub- 
script "[I]" denotes the relevant quantity with all observations in / deleted. 
Let Y[j] be a subsample of Y, with Y/ = {^(ij) : S 1} deleted, and 
p(Y[j]|0) be its probability function. We define the maximum likelihood 
estimators of for the full sample Y and a subsample Y[/] as 

(2.1) = argmaxlogp(Y|0) and ^j/j = argmaxlogp(Y[/] 



respectively. Cook's distance for /, denoted by CD(/), can be defined as 
follows: 

(2.2) CB{i) = {e[j]-6fGn9{e[i]-e), 

where Gne is chosen to be a positive definite matrix. The matrix One is not 
changed or re-estimated when a subset of the data is deleted. Throughout 
the paper, Gno is set as — 9|logp(Y|0) or its expectation, where 9| repre- 
sents the second-order derivative with respect to 9. For clustered data, the 
observations within the same cluster are correlated. A sensible model p(Y\6) 
should explicitly model the correlation structure in the clustered data and 
thus — 9|logp(Y|0) implicitly incorporates such a correlation structure. 

More generally, suppose that one is interested in a subset of 6 or qi linearly 
independent combinations of 6, say Ij^O, where L is a (7 x gi matrix with 
rank(L) = qi [4, 10]. The partial influence of the subset / on L^0, denoted 
by CD(/|L), can be defined as 

(2.3) CD(/|L) = (0[,] - 0)^L{L^G;JL}-1L^(0[,] - 6). 

For notational simplicity, even though we may focus on a subset of 0, we do 
not distinguish between CD(/|L) and CD(/) throughout the paper. 
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Based on (2.2), we know that Cook's distance CD(/) is explicitly deter- 
mined by three components, including the current model fitted to the data, 
denoted by Ai, the dataset Y and the subset /, itself. Cook's distance is also 
implicitly determined by the goodness of fit of to Y for I, denoted by 
G{I\Y, and the degree of the perturbation to A4 introduced by deleting 
the subset I, denoted by T'{I\J^). Thus, we may represent CD(/) as follows: 

(2.4) CBil) = Fi{I,M,Y) = F2{r{I\M),G{I\Y,M)), 

where Fi (■,■,■) and F2{-,-) represent nonlinear functions. 

We may use the value of CD(/) to assess the influential level of the sub- 
set /. We may regard a subset I as influential if either the value of CD(/) 
is relatively large, compared with other Cook's distances, or the magnitude 
of CD(/) is greater than the critical points of the distribution [10]. How- 
ever, for complex data structures, we will show that it is useful to compare 
Cook's distance relative to its associated degree of perturbation. 

2.2. Degree of perturbation. Consider the subset / and the current mod- 
el Ai. We are interested in developing a measure to quantify the degree of 
the perturbation to Ai introduced by deleting the subset /, regardless of 
the observed data Y. We emphasize here that the degree of perturbation is 
a property of the model, unlike Cook's distance which is also a property of Y. 
Abstractly, V{I\Ai) should be defined as a mapping from a subset / and Ai 
to a nonnegative number. However, according to the best of our knowledge, 
no such quantities have ever been developed to define a workable V{I\Ai) 
for an arbitrary subset / in general parametric models, due to many con- 
ceptual difficulties [11]. Specifically, even though [11] placed the Euclidean 
geometry on the perturbation space for one-sample problems, such a geo- 
metrical structure cannot be easily generalizable to general data structures 
(e.g., correlated data) and related parametric models. For instance, for cor- 
related data, a sensible model Ai should model the correlation structure, 
and a good measure V^IlM) should explicitly incorporate the correlation 
structure specified in Ai and the subset I. However, the Euclidean geom- 
etry proposed by [11] cannot incorporate the correlation structure in the 
correlated data. 

Our choice of 'P(/|A^) is motivated by five principles, as follows: 

• (P.a) (nonnegativity) For any subset /, V{I\Ai) is always nonnegative. 

• (P.b) (uniqueness) V{I\Ai) =0 if and only if / is an empty set. 

• (P.c) (monotonicity) If hch, then VihlM) <V{Ii\M). 

• (P.d) (additivity) If hCh, h.2 = h-h and p{Y i^,^\Y^i^^,e) = p{Y 
Y[j^,^^,0) for all 6>, then we have r{Ii\M) = VihlM) +V{Ii.2\M). 

• (P.e) V{I\Ai) should naturally arise from the current model Ai, the 
data Y and the subset /. 
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Principles (P.a) and (P.b) indicate that deleting any nonempty subset al- 
ways introduces a positive degree of perturbation. Principle (P.c) indicates 
that deleting a larger subset always introduces a larger degree of pertur- 
bation. Principle (P.d) presents the condition for ensuring the additivity 
property of the perturbation. Since Y[/^ is the union of Y[/^] and Y/j, 
p(Y/^ 2|Y[7j],0) = p(Y/j.2|Y[/^.2], 0) is equivalent to that of Y/^ 2 being in- 
dependent of Y/2 given Yj/^j. The additivity property has important im- 
plications in cross-sectional, longitudinal and family data. For instance, in 
longitudinal data, the degree of perturbation introduced by simultaneously 
deleting two independent clusters equals the sum of their degrees of individ- 
ual cluster perturbations. 

Principle (P.e) requires that V{I\J^) depend on the triple (A^,Y,/). 
We propose ^(/ITW) based on the Kullback-Leibler divergence between the 
fitted probability function p(Y\6) and the probability function of a model for 
characterizing the deletion of Y/, denoted by p(Y\6, 1). Note that p(Y\6) = 
p{Y[j]\6)p(Yj\Y[jj,9), where p(Y/|Y[/],0) is the conditional density of Y/ 
given Y[/]. Let 0^, be the true value of under A4 [24, 25]. We define 
p{Y\6,I) as follows: 

(2.5) p{Y\e,I) =p(Y[,]|0)p(Y,|Y[,],0,), 

in which p(Y/| Y[/] , 0*) is independent of 6. In (2.5), by fixing 6 = 6^, in 
p(Y/|Y[/], 0), we essentially drop the information contained in Y/ as we 

estimate 9. Specifically, is the maximum likelihood estimate of 6 under 
p{Y\6,I). If M is correctly specified, then p(Y/|Y[/], 0*) is the true data 
generator for Yj given Yj/j. The Kullback-Leibler distance between p{Y\6) 
and p{Y\6,I), denoted by KL(Y, /), is given by 

We use KL{Y ,6\0^,, I) to measure the effect of deleting Yj on estimating 6 
without knowing that the true value of is 0*. If Yj is independent of Y[/], 
then we have 

KL(Y,6\e,,I) = Jp(Yj\e)log(j^^yYj, 

which is independent of Yj/] . In this case, the effect of deleting Y/ on esti- 
mating 6 only depends on {p{Yj\6) -.6 £ &}. 

A conceptual difficulty associated with KL(Y,0|0*,/) is that both 6 
and 0* are unknown. Although 0* is unknown, it can be assumed to be 
a fixed value from a frequentist viewpoint. For the unknown 6, we can al- 
ways use the data Y and the current model M to calculate an estimator 6 
in a neighborhood of 0*. Under some mild conditions [24, 25], one can show 
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that \/n{Q — 0*) is asymptotically normal, and thus should be centered 
around 0*. Moreover, since Cook's distance is to quantify the change of the 
parameter estimates after deleting a subset, we need to consider all possi- 
ble around 0*, instead of focusing on a single 0. Specifically, we consider Q 
in a neighborhood of 0^, by assuming a Gaussian prior for with mean 
and positive definite covariance matrix (e.g., the Fisher information ma- 
trix), denoted by p(0|0*,S*). Finally, we define V(J\M) as the weighted 
Kullback-Leibler distance between p(Y|0) and p(Y|0,/) as follows: 

(2.7) v{i\M) = j KL{Y,0\e „i)p{e\e„i:^)d9. 

This quantity V{I\Ai) can also be interpreted as the average effect of delet- 
ing Y/ on estimating with the prior information that the estimate of 6 is 
centered around 0*- Since V{I\M) is directly calculated from the model M 
and the subset /, it can naturally account for any structure specified in Ai. 
Furthermore, if we are interested in a particular set of components of 6 and 
treat others as nuisance parameters, we may fix these nuisance parameters 
at their true value. 

To compute P(I|A1) in a real data analysis, we only need to specify Ai 
and (0*, S*). Then we may use some numerical integration methods to com- 
pute V{I\Ai). Although (0,,,S,,) are unknown, we suggest substituting 0^ 
by an estimator of 0, denoted by 6, and by the covariance matrix of 0. 
Throughout the paper, since is a consistent estimator of 0* [24, 25], we 
set = 6 and S* as the covariance matrix of 6. 

We obtain the following theorems, whose detailed assumptions and proofs 
can be found in the Appendix. 

Theorem 1. Suppose that L{{Y ■.p{Yj\Y[j],e) = p{Yj\Y[j],e^)}) > 
for any 0^0^, where L{A) is the Lehesgue measure of a set A. Then, 
V{I\M) defined in (2.7) satisfies the four principles (P.a)-(P.d). 

As an illustration, we show how to calculate V{I\Ai) under the standard 
linear regression model for cross-sectional data as follows. 

Example 1. Consider the linear regression model Ui = x^'/S^ where Xj 
is a p X 1 vector, and the Ei are independently and identically distributed 
(i.i.d.) as A^(0, cr^). Let y = {yi, . . . , y-n)'^ and X be an n x p matrix of rank p 
with ith row xf . In this case, 6 = (/3^, cr^)^. Recall that ^ = (X^X)~^X^y, 

= y^(I„ - H,)y/n, Cov(^) = ^^(X^X)-! and var(a2) = 2at/n, where I„ 
is an n X n identity matrix and Hx = (hij) = X(X-^X)~^X-^. We first com- 
pute the degree of the perturbation for deleting each {yi,Xi). We consider 
two scenarios: fixed and random covariates. For the case of fixed covari- 
ates, Ai assumes yi ~ A^(x^/3, o"^). After some algebraic calculations, it can 
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be shown that V{{i}\M) equals 

^^Ee[{(3 - /3J(/3-/3,f]x,. ^ ^ 1 
0-2 ^ 2n 2 



(2.8) 0.5mogial/a')] + 0.5 "- "^^^^ " ^1^,^^ " « ^ + 



where Eg is taken with respect to p{0\6^:,G~q). Moreover, the right-hand 
side of (2.8) contains only terms involving n and X, since perturbation 
is defined only in terms of the underlying model Ad. This is also at the 
core of why only stochastic ordering is possible for Cook's distance, which 
is a function of both the perturbation and the data. See Section 2.3 for 
detailed discussions. Furthermore, if (3 is the parameter of interest in 6 
and cr^ is a nuisance parameter, then 0.5EQ[log{(j'^ / a"^)], and l/(2n) can be 
dropped from P({i}|jW) in (2.8). 

Furthermore, for the case of random covariates, we assume that the Xj's 
are independently and identically distributed with mean fi^ and covariance 
matrix T^x- It can be shown that V{{i}\J^) equals 

(2.9) 0.5Ee[log{al/a^)] + O.ba^hriJ^xEeiiP - f3,){f3 - /3J^]} ^^ + ^. 

In In 

If (3 is the parameter of interest in 6, and a"^ is a nuisance parameter, then 
V{{i}\M.) reduces to p/{2n). Furthermore, consider deleting a subset of ob- 
servations { (yij. , Xjj. ) : /c = 1, . . . , n{I)} and / = {ii, . . . , in(i)}- It follows from 

Theorem 1 that V{{ii, . . . ,in[i)]\M) = Y£i'^i{^k]\M). Furthermore, for 
the case of random covariates, we have V{I\Ai) = n(/)'P({l}|A4) for any 
subset / with n{I) observations. Thus, in this case, deleting any two sub- 
sets Ii and I2 with the same number of observations, that is, n{Ii) = n[l2), 
has the same degree of perturbation. An important implication of these cal- 
culations in real data analysis is that we can directly compare CD(/i) and 
CD(/2) when n{h) = n{l2). 

2.3. Cook's distance and degree of perturbation. To understand the rela- 
tionship between V{I\M) and CD(/) in (2.4), we temporarily assume that 
the fitted model Ai is the true data generator of Y. To have a better un- 
derstanding of Cook's distance, we consider the standard linear regression 
model for cross-sectional data as follows. 

Example 1 (Continued). We are interested in (3 and treat cr^ as a nui- 
sance parameter. We first consider deleting individual observations in linear 
regression. Cook's distance [8] for case (yj,Xj), is given by 

(^-^m)^X^X(^-^m) ^2 , ha 

(2.10) CD({z})= ^ ^ "^'^^^ =^^2 



a 



2*1-/1 



where (3 is the least squares estimate of (3, a"^ is a consistent estimator 
of 0-2, ti = ei/{ay/l - hn) and ^[jj = ^ - (X^X)~^Xjej/(l - ha), in which 
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Gi = Vi — ^fP- It should be noted that except for a constant p, CD({i}) 
is almost the same as the original Cook's distance (Cook [8]). As shown 
in (2.8) and (2.9), regardless of the exact value of (yj,Xj), deleting any (yi,Xj) 
has approximately the same degree of perturbation to M. Moreover, the 
CD{{i}) are comparable regardless of i. Specifically, if Si ~ A^(0,cj^), then 
follows the x^(l) distribution for all i. For the case of random covariates, 
if Xj are identically distributed, then all CD({i}) are truly comparable, since 
they follow the same distribution. 

We consider deleting multiple observations in the linear model. Cook's 
distance for deleting the subset / with n(I) is given by 

(^- An)^X^X(/3-^rn) 1 „ 1 1 

(2.11) '"^ ^ = ^ej(l„(,) - Hjy'Hj(I,,^j) - Hj)-% 

where ej is an n{I) x 1 vector containing all Cj for i G 7 and Hj = 
X/(X^X)~^Xj, in which X/ is an n{I) x p matrix whose rows are x?" for 
all i G I. Similar to the deletion of a single case, deleting any subset with the 
same number of observations introduces approximately the same degree of 
perturbation to Ai, and the CD (I) are comparable among all subsets with 
the same n(I). We will make this statement precise in Theorem 2 given 
below. 

Generally, we want to compare CD(/i) and CD(/2) for any two subsets 
with n{Ii) 7^ n(/2). As shown in Example 1, when n(Ii) > n(/2), deleting Ii 
introduces a larger degree of perturbation to model M compared to delet- 
ing 12- To compare Cook's distances among arbitrary subsets, we need to 
understand the relationship between V{I\M) and CD(/) for any subset /. 
Surprisingly, in linear regression for cross-sectional data, we can show the 
stochastic relationship between V{I\Ai) and CD(/), as follows. 

Theorem 2. For the standard linear model, where y = X/3 -|- e and e ~ 
N{0,a'^ln), we have the following results: 

(a) For any I2 C Ii, CD(/i) is stochastically larger than CD (12) for any X, 
that is, P(CD(/i) > t\M) > P(CD(/2) > t\M) holds for any t > 0. 

(b) Suppose that the components ofJi-j and'X.j/ are identically distributed 
for any two subsets I and I' with n{I) = n{I'). Thus, CD(/) and CD(/') 
follow the same distribution when n{I) = n{I') and CD(/i) is stochastically 
larger than CD(/2) for any two subsets I2 and Ii with n(/i) > n(/2). 

Theorem 2(a) shows that the Cook's distances for two nested subsets 
satisfy the stochastic ordering property. Theorem 2(b) indicates that for 
random covariates, the Cook's distances for any two subsets also satisfy the 
stochastic ordering property under some mild conditions. 

According to Theorem 2, for more complex data structures and models, 
it may be natural to use the stochastic order to stochastically quantify the 
positive association between the degree of the perturbation and the magni- 
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tude of Cook's distance. Specifically, we consider two possibly overlapping 
subsets h and h with V{Ii\M) >V{l2\M). Although CD(Ii) may not be 
greater than CD(/2) for a fixed dataset Y, CD(/i), as a random variable, 
should be stochastically larger than CD(/2) if -M. is the true model for Y. 
We make the following assumption. 

Assumption Al. For any two subsets h and h with 'P{Ii\M) > 'P{l2\M), 

(2.12) P(CD(/i) > t\M) > P(CD(/2) > t\M) 

holds for any t > 0, where the probability is taken with respect to M. 

Assumption Al is essentially saying that if M is the true data gener- 
ator, then CD(Ii) stochastically dominates CD(l2) whenever 'P{Ii\M) > 
7^(12 1 A^). According to the definition of stochastic ordering [20], we can 
now obtain the following proposition. 

Proposition 1. Under Assumption Al, for any two subsets Ii and I2 
with V{Ii\Ai) >V{l2\M), Cook's distance satisfies 

(2.13) E[h{CB{Ii))\M] > E[h{CB{l2))\M] 

and holds for all increasing functions h{-). In particular, we have £'[CD(/i)| 
A4] > E[CD{l2)\J^] and QcD{ii){oi\-M.) is greater than the a-quantile of 
<5cd(/2)(q^I-^) f'^^ '^'^y ^ S [0) l]j where QcD(/)('^l-^) denotes the a-quantile 
of the distribution of CD(I) for any subset I . 

Proposition 1 formally characterizes the fundamental issue of Cook's dis- 
tance. Specifically, for any two subsets Ii and I2 with 'P{Ii\M.) > V{l2\M), 
CD(/i) has a high probability of being greater than CD(/2) when is the 
true data generator. Thus, Cook's distance for subsets with different degrees 
of perturbation are not directly comparable. More importantly, it indicates 
that CD(I) cannot be simply expressed as a linear function of P(I|A^). 
Thus, the standard solution, which standardizes CD(/) by calculating the 
ratio of CD(/) over V{I\M), is not desirable for controlling for the effect of 

r{i\M). 

2.4. Scaled Cook's distances. We focus on developing several scaled 
Cook's distances for /, denoted by SCD(I), to detect relatively influential 
subsets, while accounting for the degree of perturbation V{I\A4). Since we 
have characterized the stochastic relationship between 'P{I\M) and CD(/) 
when M is the true data generator, we are interested in reducing the effect 
of the difference among V{I\M) for different subsets I on the magnitude of 
CD(/). A simple solution is to calculate several features (e.g., mean, median, 
or quantiles) of CD(/) and match them across different subsets under the 
assumption that M is the true data generator. Throughout the paper, we 
consider two pairs of features including (mean, Std) and (median, Mstd), 
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where Std and Mstd, respectively, denote the standard deviation and the 
median standard deviation. By matching any of the two pairs, we can at 
least ensure that the centers and scales of the scaled Cook's distances for 
different subsets are the same when M is the true data generator. 

We introduce two scaled Cook's distance measures, called scaled Cook's 
distances, as follows. 

Definition 1. The scaled Cook's distances for matching (mean, Std) 
and (median, Mstd) are, respectively, defined as 

CD(J)-ii;[CD(/)|.M] 
Std[CD(/)|A^] 

_ CD(/)-QcD(/)(0-5|.M) 
- Mstd[CD(/)|A^] ' 
where both the expectation and the quantile are taken with respect to M. 

We can use SCDi(/) and SCD2(/) to evaluate the relatively influential 
level for different subsets /. A large value of SCDi(/) [or SCD2(/)] indicates 
that the subset / is relatively influential. Therefore, for any two subsets Ii 
and I2, the probability of observing the event SCD(/i) > SCD(/2) and that 
of the event SCD(/i) < SCD(/2) should be reasonably close to each other. 
Thus, the SCD(/) are roughly comparable. Note that the scaled Cook's 
distances do not provide a "per unit" effect of removing one observation 
within the set /, whereas they measure the standardized influential level of 
the set / when M is true. Moreover, the standardization in Definition 1 still 
implies that higher than average values of CD(/) still correspond with high 
positive values of SCD(/), even though for some deletions, it is possible for 
SCD(/) to be negative unlike CD(/). 

The next task is how to compute E[CB{I)\M], Std[CD(/)|7W], 
Mstd[CD(/)|A^] and QcD(/)(0-5|A^) for each subset / under the assump- 
tion that 7W is the true data generator. Computationally, we suggest using 
the parametric bootstrap to approximate the four quantities of CD(/) as 
follows: 

Step 1. We use = {p{Y\0)} to approximate the model 7W = {p(Y\9^)}, 
generate a random sample from p(Y\6) and then calculate CD(/)''*) = 
i^i(/,7W, V^) for each s and each subset /. 

Step 2. By repeating Step 1 S times, we can obtain a sample {CD(/)('*) : s = 
1, . . . , 5} and then we use its empirical mean CD(/) = Ef=iCD(/)W/5 to 
approximate £'[CD(/)|A^]. 

Step 3. We approximate Std[CD(/)|A^], QcD(/)(0.5|A^) and Mstd[CD(/)|A^] 
by using their corresponding empirical quantities of {CD (/)(*) : s = 1, . . . , S}. 

In this process, we have used M to approximate A4 [24] and simulated 
data Y'^ from M in the standard parametric bootstrap method. If Y truly 
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comes from M, then the simulated data Y'^ should resemble Y. Since 6 
is a consistent estimate of 0*, E[Fi{I,M,Y)\M]^ E[Fi{I,M,Y)\M] and 
thus CD(I) is a consistent estimate of E[Fi{I , Ai Similar arguments 
hold for the other three quantities of CD(/). In Steps 2 and 3, we can use 
a moderate S, say 5 = 100, in order to accurately approximate all four 
quantities of CD(/). According to our experience, such an approximation is 
very accurate, even for small n. See the simulation studies in Section 3.1 for 
details. However, for most statistical models with complex data structures, 
it can be computationally intensive to compute for each Y'^. We will 
address this issue in Section 2.6. 

As an illustration, we consider how to calculate SCDi (/) for any subset / 
in the linear regression model. 

Example 1 (Continued). In (2.11), since all CD(/) share ct^, we re- 
place by cr^. Thus, we approximate CD (I) by CD*(/) = e-^W^e/cr^, 
where e = (ei ,...,£„)-'" ~ A^(0, crjl„,) and 



To compute SCDi(/), we just need to calculate the two quantities i?[CD,= (I)|A^] 
and Std[CD*(/)|7W] . Since CD*(I) is a quadratic form, it can be shown that 



where denotes the expectation taken with respect to X. 

2.5. Conditionally scaled Cook's distances. In certain research settings 
(e.g., regression), it may be better to perform influence analysis while fix- 
ing some covariates of interest, such as measurement time. For instance, in 
longitudinal data, if different subjects can have different numbers of mea- 
surements and measurement times, which are not covariates of interest in 
an influence analysis, it may be better to eliminate their effect in calculating 
Cook's distance. We are interested in comparing Cook's distance relative to 
V{I\Ai) while fixing some covariates. 

To eliminate the effect of some fixed covariates, we introduce two condi- 
tionally scaled Cook's distances as follows. 

Definition 2. The conditionally scaled Cook's distances (CSCD) for 
matching (mean, Std) and (median, Mstd) while controlling for Z are, re- 
spectively, defined as 



W, = (I, - HMiIr,(^j) - HiY^Hiil^t^i^ - Hi)-^Uj{ln - H,). 



E[CB,{I)\M] 



E{tT[{I^^i)-Hir']\Mx}-n{I), 
Yar{tT[{I^^i)-Hi)-']\Mx} 
+ 2£;{tr[{(I„(,) - Hjr^Hif]\Mx} 



Var[CD,(/)|M] 



CSCDi(I,Z) 



CD{I) - E[CB{I)\M,Z] 
Std[CD(/)|7W,Z] 
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CSCD2(/, Z) - Mstd[CD(/)|A^,Z] ' 

where Z is the set of some fixed covariates in Y and the expectation and 
quantiles are taken with respect to A4. given Z. 

According to Definition 2, these conditionally scaled Cook's distances 
can be used to evaluate the relative influential level of different subsets / 
given Z. Similar to SCDi(/) and SCD2(/), a large value of CSCDi(/, Z) 
[or CSCD2(/, Z)] indicates a large influence of the subset / after controlling 
for Z. It should be noted that because Z is fixed, the CSCDfc(/, Z) do not 
reflect the influential level of Z, and the CSCDfc(/, Z) may vary across dif- 
ferent Z. The conditionally scaled Cook's distances measure the difference 
of the observed influence level of the set / given Z to the expected influence 
level of a set with the same data structure when M is true and Z is fixed. 

The next problem is how to compute £^[CD(/)|A^, Z], Std[CD(/)|A^, Z], 
QcD(/)(0-5|Al,Z) and Mstd[CD(/)| A^, Z] for each subset / when M is the 
true data generator and Z is fixed. Similar to the computation of the scaled 
Cook's distances, we can essentially use almost the same approach to ap- 
proximate the four quantities for CSCDi(/, Z) and CSCD2(/, Z). However, 
a slight difference occurs in the way that we simulate the data. Specifically, 
let be the data Y with Z fixed. We need to simulate random samples Y^ 
from Mz = {p(Yz|Z, 6)] and then calculate CD(/)(*) = Fi{I,Mz, (Y|, Z)) 
for each subset /. 

As an illustration, we consider how to calculate CSCDi(/, Z) for any 
subset / in the linear regression model. 

Example 1 (Continued). We set Z = X to calculate CSCDi(/, Z). We 
need to compute E[CB^{I)\M,Z] and Std[CD,(/)|X, Z]. Since CD*(/) is 
a quadratic form, it is easy to show E[CD^:{I)\A4] = tr [(!„(/) — Hi)~^] — n{I) 
and Var[CD,(/)|7W] = 2tr[{(I,(^) - Hi)~^Hi}^]. 

2.6. First-order approximations. We have focused on developing the 
scaled Cook's distances and their approximations for the linear regression 
model. More generally, we are interested in approximating the scaled Cook's 
distances for a large class of parametric models for both independent and 
dependent data. 

We obtain the following theorem. 

Theorem 3. If Assumptions A2-A5 in the Appendix hold and n{I)/n — )■ 
7 E [0,1), where n{I) denotes the number of observations of I , then we have 
the following results: 

(a) LetYn{e) = -dl\ogp{Y\e), ^{6) = de\ogp{Y i\Y^i^,e) and si{e) = 
— (9g logp(Y/|Y[7],0), CD(/) can be approximated by 

(2.14) CD(/) = ii{ef[¥n{e) - s,(^)]-'F„(0)[F„(0) - si{e)]-Hi{e)- 
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(b) E[CD{I)\M] « tT{{E[Fn{e)\M] - E[sj{6)\M]}-^E[sii9)\M]y, 

(c) E[CBiI)\M,Z] « tri{E[Fnie)\M,Z] - E[sji6)\M,Z]}-^ E[sj{e)\ 
M,Z]). 

Theorem 3(a) establishes the first-order approximation of Cook's distance 
for a large class of parametric models for both dependent and independent 
data. This leads to a substantial savings in computational time, since it 
is computationally easier to calculate ii{0), F„(0) and si{6) compared to 
CD(/). Theorem 3(b) and (c) give an approximation of E[CD(/)|A1] and 
E[CD(/)|A1, Z], respectively. Generally, it is difficult to give a simple ap- 
proximation to Var[CD(/)| A^] and Var[CD(/)|A^, Z], since it involves the 
fourth moment of ii{0), which does not have a simple form. 

Based on Theorem 3, we can approximate the scaled Cook's distance 
measures as follows. 

Step (i). We generate a random sample from p(Y|Z,0) and calculate 
CD(/) based on the simulated sample and fixed Z, denoted by CD(/)*. 
Explicitly, to calculate CD(/)*, we replace Y in f/(0), F„(0), and sj{6) 
by Y*. The computational burden involved in computing CD(/)** is very 
minor. 

Compared to the exact computation of the scaled Cook's distances, we 
have avoided computing the maximum likelihood estimate of 6 based on Y* , 
which leads to great computational savings in computing CD(/)'* for large S, 
say S > 100. Theoretically, since is a consistent estimate of 0*, i?[CD(/)|7W] 
is a consistent estimate of £^[CD(/)|7W]. Compared with reestimating 6 for 
each Y**, a drawback of using 6 in calculating CD(/)* is that CD(/)* does 
not account for the variability in 6. Similar arguments hold for the other 
three quantities of CD(/). 

Step (ii). By repeating Step (i) S times, we can use the empirical quantities 
of {Cl5(/)* : s = 1, . . . , 5} to approximate E[CD{I)\M,Z], Std[CD(/)|A^ , Z], 
QcD(/)(0-5|A^, Z) and Mstd[CD(/)|7W, Z]. Subsequently, we can approxi- 
mate CSCDi(/, Z) and CSCD2(/, Z) and determine their magnitude based 
on CD(/)^ 

For instance, let M[CD(/)] and Std[CD(/)] be, respectively, the sample 
mean and standard deviation of {CD(/)* : s = 1, . . . , S}. We calculate 

csc5.(j.z) = '^W-j^'^W". 

Std[CD(/)] 

c^B,ii,zy = ^^^'L-3^™. 

Std[CD(/)] 
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We use CSCDi(/, Z) to approximate CSCDi(/, Z) and then compare 
CSCDi(/, Z) across different / in order to determine whether a specific 
subset I is relatively influential or not. Moreover, since CSCDi(/, Z)* can 
be regarded as the "true" scaled Cook's distance when p(Y|Z,0) is true, 
we can either compare CSCDi(I, Z) with CSCDi(I, Z)'' for all subsets / 

and s or compare CSCDi(/, Z) with CSCDi(/, Z)'' for all s. Specifically, we 
calculate two probabilities as follows: 

5 ^-^^ 

(2.15) Pa(/,Z) = ^l(CSCDi(/,Z)'' < CSCDi(/,Z))/5, 

s=l 

(2.16) Pb(/. z) = E ^ CS(55, (/, Z)) 



s=l 



S X #(/) 



where #(/) is the total number of all possible sets, and l(-) is an indicator 
function of a set. We regard a subset / as influential if the value of Pa{I, Z) 
[or Pb(/, Z))] is relatively large. Similarly, we can use the same strategy to 
quantify the size of CSCD2(/,Z), SCDi(/) and SCD2(/). _ 

Another issue is the accuracy of the first-order approximation CD(/) to 
the exact CD(/). For relatively influential subsets, even though the accuracy 
of the flrst-order approximation may be relatively low, CD(I) can easily 
pick out these influential points. Thus, for diagnostic purposes, the first- 
order approximation may be more eff^ective at identifying influential subsets 
compared to the true Cook's distance. We conduct simulation studies to 
investigate the performance of the first-order approximation CD(/) relative 
to the exact CD(/). Numerical comparisons are given in Section 3. 

We consider cluster deletion in generalized linear mixed models (GLMM). 

Example 2. Consider a dataset, that is, composed of a response yij, 
covariate vectors 'X-ijip x 1) and Cij{pi x 1), for observations j = l,...,mj 
within clusters i = 1, . . . ,n. The GLMM assumes that conditional on a x 1 
random variable bj, yij follows an exponential family distribution of the 
form [18] 

(2.17) p{yij\hi) =exjp{a{Ty^[yijr]ij - b{r]ij)] + c{yij , t)} , 

where rjij = k{:x.fjf3 + cfjhi) in which /3 = {f3i, . . . ,f3p)'^ and k{-) is a known 
continuously differentiable function. The distribution of bj is assumed to be 
A^(0, S), where S = '^{'y) depends on a p2 x 1 vector 7 of unknown variance 
components. In this case, we fix all covariates Xjj and Cij and all rrii and 
include them in Z. For simplicity, we fix (')',r) at an appropriate estimate 
(7,f) throughout the example. 
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We focus here on cluster deletion in GLMMs. After some calculations, the 
first-order approximation of CD(/j) for deleting the ith cluster is given by 

(2.18) CD(/i) = d^ii0f[Fn0) - fim-^FnimFniP) - HPT'dpiiiP), 

where = {(z, 1), . . . , {i, rrii)}, is the log-likelihood function for the ith 
cluster, fi(/3) = -djii{l3) and F„,(/3) = X^Li ^dP)- Note that 

Then, conditional on all the covariates and {mi, . . . , m„} in Z, we can 
show that E[CD(/i)|A^,Z] can be approximated by tr({E[F„(/3)|A^, Z] - 
E[f,,(/3)|7W, Z]}~^E[f((;9)|Al, Z]) when M is true. Moreover, we may approx- 
imate Var[CD(/j)|7W, Z] by using the fourth moment of dpii{f3^). It is not 
straightforward to approximate (5cD(/i)(0-5|A^, Z) and Mstd[CD(/i)|7W, Z]. 
Computationally, we employ the parametric bootstrap method described 
above to approximate the conditionally scaled Cook's distances CSCDi(/j, Z) 
and CSCD2(/»,Z). 

3. Simulation studies and a real data example. In this section, we illus- 
trate our methodology with simulated data and a real data example. We also 
include some additional results in the supplemental article [27]. The code 
along with its documentation for implementing our methodology is available 
on the first author's website at http : //www. bios .unc . edu/research/bias/ 
software .html. 

3.1. Simulation studies. The goals of our simulations were to examine 
the finite sample performance of Cook's distance and the scaled Cook's dis- 
tances and their first-order approximations for detecting influential clusters 
in longitudinal data. We generated 100 datasets from a linear mixed model. 
Specifically, each dataset contains n clusters. For each cluster, the random 
effect bi was first independently generated from a A^(0,cr^) distribution and 
then, given 6j, the observations yij {j = 1, . . . ,mi;i = 1, . . . ,n) were indepen- 
dently generated as yij ~ iV(x^/3 + bi,ay) and the rrii were randomly drawn 
from {1, ... ,5}. The covariates Xjj were set as (1, lij, tjj)^, among which tij 
represents time, and Ui denotes a baseline covariate. Moreover, tij = log(j) 
and the Uj's were independently generated from a A^(0, 1) distribution. For 
all 100 datasets, the responses were repeatedly simulated, whereas we gen- 
erated the covariates and cluster sizes only once in order to fix the effect of 
the covariates and cluster sizes on Cook's distance for each cluster. The true 
value of = 0"^, 0"^)^ was fixed at (1, 1, 1, 1, 1)"^. The sample size n was 
set at 12 to represent a small number of clusters. 

For each simulated dataset, we considered the detection of infiuential 
clusters [4]. We fit the same linear mixed model and used the expectation- 
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maximization (EM) algorithm to calculate and for each cluster /. We 
treated {ab,(7y) as nuisance parameters and /3 as the parameter vector of 
interest. We calculated the degree of the perturbation 'Pi{i}\.M) for deleting 
each subject {i} while fixing the covariates, and then we calculated the 
conditionally scaled Cook's distances and associated quantities. Let Xj be 
an mj X 3 matrix with the jth row being x?'^-. It can be shown that for the 
case of fixed covariates, we have 

(3.1) r{{i}\M) = 0.5tr{xf i?,(a)"ix,i?^[(/3 - /3J(/3 - P,f]}, 

where is taken with respect to p(/3|/9^, G~^) and Ri{cx) = (Tylm^ +0"^!®^, 
in which a = {(rl^cTy)'^ and 1^^ is an x 1 vector with all elements equal 
to one. We set G~l = ELi and substituted by fi. 

We carried out three experiments as follows. The first experiment was to 
evaluate the accuracy of the first-order approximation to CD (I). The explicit 
expression of CD(/) is given in Example S2 of the supplementary document. 
We considered two scenarios. In the first scenario, we directly simulated 100 
datasets from the above linear mixed model. In the second scenario, for 
each simulated dataset, we deleted all the observations in clusters n — 1 
and n and then reset (mi,6i) = (1,4) and {mn,bn) = (5,3) to generate y^j 
for i = l,n and all j according to the above linear mixed model. Thus, the 
new first and nth clusters can be regarded as influential clusters due to the 
extreme values of bi and 6„. Moreover, the number of observations in these 
two clusters is unbalanced. We calculated CD(/) and CD(/), the average 
CD(/), and the biases and standard errors of the differences CD(/) — CD(/) 
for each cluster {i} (Table 1). 

Inspecting Table 1 reveals three findings as follows. First, when no in- 
fluential cluster is present in the first scenario, the average CD(/) is an 
increasing function of V{I\M), whereas it is only positively proportional to 
the cluster size n{I) with a correlation coefficient of 0.83. This result agrees 
with the results of Proposition 1. Second, in the second scenario, the aver- 
age CD (I) for the true "good" clusters is positively proportional to P(/|7W) 
with a correlation coefficient of 0.76, while that for the infiuential clusters 
is associated with both V{I\A4) and the amount of influence that we intro- 
duced. Third, for the true "good" clusters, the first-order approximation is 
very accurate and leads to small average biases and standard errors. Even 
for the influential clusters, CD(/) is relatively close to CD(/). For instance, 
for cluster {n}, the bias of 0.19 is relatively small compared with 0.78, the 
mean of CD({?i}). 

In the second experiment, we considered the same two scenarios as the first 
experiment. Specifically, for each dataset, we approximated -E'[CD(I)|A^, Z] 
and Std[CD(/)|7W, Z] by setting 5 = 200 and using their empirical val- 
ues, and calculated their first approximations M[CD(/)] and Std[CD(/)]. 
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Across all 100 data sets, for each cluster /, we computed the averages 
of E[CI){I)\M,Z] and Std[CD(7)|A^, Z], and^the biases and standard er- 
rors of^the differences S[CD(/)|7W, Z] - M[CD(/)] and Std[CD{I)\M,Z] - 
Std[CD(/)]. 

Table 1 shows the results for each scenario. First, in both scenarios, the 
average £'[CD(/)|A^ , Z] is an increasing function of V{I\J^), whereas it is 
only positively proportional to the cluster size n{I) with a correlation coeffi- 
cient (CC) of 0.80. This is in agreement with the results of Proposition 1. The 
average of Std[CD(/)| Al, Z] are positively proportional to (CC = 0.76) 
and P(/|A^) (CC = 0.99). Second, for all clusters, the first-order approxi- 
mations of £'[CD(/)|A^, Z] and Std[CD(/)|A^, Z] are very accurate and lead 
to small average biases and standard errors. 

The third experiment was to examine the finite sample performance of 
Cook's distance and the scaled Cook's distances for detecting influential 
clusters in longitudinal data. We considered two scenarios. In the first sce- 
nario, for each of the 100 simulated datasets, we deleted all the observations 
in cluster n and then reset m„ = 1 and varied 6.„ from 0.6 to 6.0 to gener- 
ate yn,j according to the above linear mixed model. The second scenario is 
almost the same as the first scenario, except that we reset m„ = 10. Note 
that when the value of bn is relatively large, for example, bn = 2.5, the nth 
cluster is an influential cluster, whereas the nth cluster is not influential 
for small bn- A good case-deletion measure should detect the nth cluster 
as truly influential for large whereas it does not for small 6„. For each 
data set, we approximated CSCDi(/,Z), CSCD2(/,Z), CSCDi(/,Z) and 

CSCD2(/, Z) by setting 5 = 100. Subsequently, we calculated P^(/, Z) and 
Pb{I,Z) in (2.15) andPc(/,Z) = X:/^{„|l(CD(/)<CD({n}))/(n-l). Fi- 
nally, across all 100 datasets, we calculated the averages and standard errors 
of all diagnostic measures for the nth cluster for each scenario. 

Inspecting Figure 1 reveals some findings as follows. First, deleting the 
nth cluster with 10 observations causes a larger effect than that with 1 
observation [Figure 1(a) and (e), (d) and (h)]. As expected, the distributions 

of CD({n}) and CSCDi(/, Z) shift up as 5„ increases [Figure 1(a), (b), (e) 
and (f)]. Second, in the first scenario, CD({n}) is stochastically smaller than 
most other CD(/)s, when the value of bn is relatively small [Figure 1(d)]. 
However, in the second scenario, CD({n}) is stochastically larger than most 
other CD(/)s [Figure 1(h)] even for small values of 6„. Specifically, when 
rUn = 1, the average Pc{{n}, Z) is smaller than 0.4 as 6„ = 0.6 and 6.„ = 1.2, 
whereas when m„ = 10, the average Pc{{n}, Z) is higher than 0.75 even as 
bn = 0.6. In contrast, in the two scenarios, the value of Pb({?i},Z) is close 
to 0.5 as bn = 0.6 [Figure 1(d) and (h)]. It indicates that the cluster size 

does not have a big effect on the distribution of CSCDi(/, Z) [Figure 1(c) 
and (g)]. 



18 



H. ZHU, J. G. IBRAHIM AND H. CHO 



Table 1 

Selected results from simulation studies for n = 12 and the two scenarios: rrii, 'P{{i}\A4) , 
M, SD, Mdif (xlQ-'^) and SDdif (xlQ'^ ) of the three quantities CD(/), £'[CD(I)|>1, Z] 
and Std[CD(/)|A^, Z] . denotes the cluster size of subset {i}; P({i}|A^) denotes the 
degree of perturbation; M denotes the mean; SD denotes the standard deviation; Mdif and 
SDdif, respectively, denote the mean and standard deviation of the differences between 
each quantity and its first-order approximation. In the first scenario, all observations 
were generated from the linear mixed model, while in the second scenario, two clusters 
were influential clusters and highlighted m bold. For each case, 100 simulated datasets 
were used. Results were sorted according to the degree of perturbation for each cluster 



Scenario I Scenario II 



mi 


•p({i}|A4) 


M Mdif 


SD 


SDdif 


m-i V{{i}\M.) 


M 


Mdif 


SD 


SDdif 


1 


0.10 


0.11 


0.01 


0, 


.09 


CD(/) 
0.03 1 


0.08 


0.37 


1.01 


0.18 


0.18 


2 


0.11 


0.12 


0.32 


0, 


.12 


0.15 


2 


0.11 


0.10 


0.08 


0.09 


0.12 


2 


0.11 


0.15 


1.24 


0, 


.18 


0.64 


1 


0.11 


0.08 


0.02 


0.11 


0.02 


2 


0.13 


0.18 


0.87 


0, 


.19 


0.36 


2 


0.13 


0.13 


0.08 


0.12 


0.12 


2 


0.15 


0.17 


0.25 


0, 


.19 


0.20 


2 


0.16 


0.13 


-0.13 


0.12 


0.08 


3 


0.16 


0.23 


0.55 


0, 


.19 


0.50 


2 


0.20 


0.20 


0.08 


0.19 


0.12 


2 


0.19 


0.26 


-0.02 


0, 


.32 


0.25 


3 


0.23 


0.21 


-0.06 


0.18 


0.22 


3 


0.22 


0.34 


2.97 


0, 


.35 


0.99 


4 


0.25 


0.23 


0.37 


0.23 


0.26 


4 


0.27 


0.41 


3.35 


0, 


.38 


1.77 


5 


0.28 


0.78 


18.59 


0.61 


4.71 


5 


0.40 


0.70 


5.43 


0, 


.60 


1.90 


5 


0.37 


0.38 


0.90 


0.32 


0.46 


4 


0.57 


1.15 


1.57 


1, 


.29 


1.73 


5 


0.54 


0.70 


1.32 


0.68 


0.82 


5 


0.60 


1.21 


3.62 


1, 


.49 


1.62 


4 


0.56 


0.65 


1.06 


0.69 


0.54 


1 


0.10 


0.12 


0.22 


0, 


.02 


£;[CD(/)|A1,Z] 
0.05 1 


0.08 


0.09 


0.43 


0.01 


0.04 


2 


0.11 


0.12 


0.41 


0, 


.01 


0.03 


2 


0.11 


0.12 


0.45 


0.02 


0.04 


2 


0.11 


0.13 


0.46 


0, 


.02 


0.04 


1 


0.11 


0.13 


0.09 


0.02 


0.03 


2 


0.12 


0.15 


0.40 


0, 


.02 


0.07 


2 


0.13 


0.15 


0.38 


0.02 


0.04 


2 


0.15 


0.17 


0.34 


0, 


.03 


0.08 


2 


0.16 


0.18 


0.26 


0.02 


0.04 


3 


0.16 


0.18 


0.77 


0, 


.02 


0.08 


2 


0.20 


0.23 


0.12 


0.03 


0.05 


2 


0.19 


0.22 


0.21 


0, 


.04 


0.09 


3 


0.23 


0.27 


0.46 


0.03 


0.07 


3 


0.22 


0.26 


0.62 


0, 


.04 


0.09 


4 


0.25 


0.29 


1.13 


0.03 


0.13 


4 


0.26 


0.32 


1.63 


0, 


.03 


0.15 


5 


0.28 


0.36 


1.94 


0.04 


0.18 


5 


0.40 


0.55 


2.58 


0, 


.07 


0.29 


5 


0.37 


0.48 


1.86 


0.05 


0.18 


4 


0.57 


0.97 


2.21 


0, 


.12 


0.21 


5 


0.53 


0.82 


4.26 


0.10 


0.34 


5 


0.60 


1.03 


5.87 


0, 


.16 


0.99 


4 


0.56 


0.93 


1.64 


0.11 


0.17 


1 


0.10 


0.18 


1.48 


0, 


.04 


Std[CD(/)|X,Z] 
0.20 1 


0.08 


0.13 


1.05 


0.04 


0.22 


2 


0.11 


0.14 


1.16 


0, 


.03 


0.10 


2 


0.11 


0.14 


1.18 


0.03 


0.12 


2 


0.11 


0.15 


1.37 


0, 


.03 


0.16 


1 


0.11 


0.18 


0.78 


0.04 


0.10 


2 


0.13 


0.18 


1.72 


0, 


.05 


0.35 


2 


0.13 


0.18 


1.15 


0.03 


0.13 


2 
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Fig. 1. Simulation results from 100 datasets simulated from a linear mixed model in the 
two scenarios. The first row corresponds to the first scenario, in which mi2 = 1 and 612 
varies from 0.6 to 6.0. The second row corresponds to the second scenario, in which 
mi2 = 10 and 612 varies from 0.6 to 6.0. Panels (a) and (e) show the box plots of Cook's 
distances as a function of bi2; panels (b) and (f) show the box plots o/CSCDi(7,Z) as 
a function of bi2; panels (c) and (g) show the box plots of Pb{I,'Z) as a function 0/612; 
panels (d) and (h) show the mean curve of Pb{I ,Z) based on CSCDi(J,Z) (red line) and 
the mean curve of Pc{I,'Zi) based on CD(J) (green line) as functions 0/&12. 



3.2. Yale infant growth data. The Yale infant growth data were col- 
lected to study whether cocaine exposure during pregnancy may lead to 
the maltreatment of infants after birth, such as physical and sexual abuse. 
A total of 298 children were recruited from two subject groups (cocaine 
exposed group and unexposed group). One feature of this dataset is that 
the number of observations per children irii varies significantly from 2 to 
30 [21, 22]. The total number of data points is X^iLi?7Zj = 3176. Following 
Zhang [26], we considered two linear mixed models given by yij = iKfj/S + Sij, 
where yij is the weight (in kilograms) of the jth visit from the ith. sub- 
ject, = (1, dij, (d,j - 120)+, {dij - 200)+, {gi - 28)+, dijigi - 28)+, {dij - 
60)+{g^ - 28)+, {dij - 490)+(5i - 28)+ , s^dij , s^{dij - 120)+)^, in which dij 
and gi (days) are the age of visit and gestational age, respectively, and Si 
is the indicator for gender. In addition, we assumed Si = {£i,i, ■ ■ ■ ,£i,mi)'^ ~ 
Nm.{0,Ri{a)), where ct is a vector of unknown parameters in Ri{a.). We 
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Fig. 2. YaZe infant growth data. Panel (a) presents the line plot of infant weight against 
age, in which the observations of subject 269 are highlighted; panel (b) shows the cumulative 
residual curve versus age, m which the observed cumulative residual curve is highlighted in 
blue; and panels (c) and (d), respectively, present age versus raw residual and age versus 
standardized residual for cluster deletion. 

first considered Ri{a) = aolmi +ckilmi ■ We refer to tliis model as model Mi. 
Then, it is assumed that variance and autocorrelation parameters are, re- 
spectively, given by = exp(Qo + aid + 02^^ + CKsd^) and /o(/) = 04 + 05?, 
where / is the lag between two visits. We refer to this model as model M2. 

We systematically examined the key assumptions of models Mi and M2 
as follows. 

(i) We presented a cumulative residual plot and calculated the cumulative 
sums of residuals over the age of the visit to test i?[yjj|xjj] =ji.Jj(3 [17], 
whose p- value is greater than 0.543. It may suggest that the mean structure 
is reasonable. The cumulative residual plot is given in Figure 2(b). 

(ii) For model Mi , inspecting the plot of raw residuals rij = i/ij — ^JjP 
against age in Figure 2(c) reveals that the variance of the raw residuals 
appears to increase with the age of visit. As pointed by Zhang [26], it may 
be more sensible to use model M2. Let fj = (ri,i, . . . = -R'i(ct)~^^^i'i 
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be the vector of standardized residuals of M2, where = (rj^i, . . . ,ri^mi)'^ ■ 
The standardized residuals under M2 do not have any apparent structure 
as age increases [Figure 2(d)]. 

(iii) Under each model, we calculated CD(/) for each child [4]. We treated f3 
as parameters of interest and all elements of a as nuisance parameters. For 
model Ml, we obtained a strong Pearson correlation of 0.363 between Cook's 
distance and the cluster size. This indicates that the bigger the cluster size, 
the larger the Cook's distance measure. Figure 4(b) highlights the top ten 
influential subjects. Compared with model Mi, we observed similar findings 
by using CD(/) under model M2, which were omitted for space limitations. 

There are several difficulties in using Cook's distance under both mod- 
els Ml and M2 [3, 4, 7, 19]. First, cluster size varies significantly across 
children, and deleting a larger cluster may have a higher probability of 
having a larger influence as discussed in Section 2.3. For instance, we ob- 
serve (m285,CD({285})) = (8,0.738) and (m274, CD({274})) = (22,1.163). 
A larger CD({274}) can be caused by a larger 771274 = 22 and/or influen- 
tial subject 274, among others. Since 771274 is much larger than 777285, it is 
difficult to claim that subject 274 is more influential than subject 285. Sec- 
ond, there is no rule for determining whether a speciflc subject is influential 
relative to the fitted model. Specifically, it is unclear whether the subjects 
with larger CD({7}) are truly infiuential or not. Third, inspecting Cook's 
distance solely does not seem to delineate the potential misspecification of 
the covariance structure under model Mi. We will address these three diffi- 
culties by using the new case-deletion measures. 

(iv) Under each model, we calculated 'P{{i}\.M) for deleting each sub- 
ject {i} for fixed covariates, and then we calculated the conditionally scaled 
Cook's distances and associated quantities. We then used 1000 bootstrap 
samples to approximate CSCDi(/, Z) and CSCD2(/, Z). Subsequently, we 
calculated Pa(/,Z) and Pb(/,Z) in (2.15). 

We observed several findings. First, under model Mi, we observed a strong 
positive correlation between T'{{i}\Ai) and rui [Figure 3(a)]. Second, even 
though 70,269 = 12 is moderate, subject 269 has the largest degree of per- 
turbation. Inspecting the raw data in Figure 2(a) reveals that subject 269 
is of older age during visits compared with other subjects. Third, we also 
observed a strong positive correlation between V{{i}\A4) and Cook's dis- 
tance [Figure 3(b)], which may indicate their stochastic relationship as dis- 
cussed in Section 2.3. Fourth, we observed a positive correlation between 
Cook's distance and the conditionally scaled Cook's distance [Figure 3(b) 
and (c)], but their levels of infiuence for the same subject are quite dif- 
ferent. For instance, the magnitude of CSCDi({269}, Z) is only moderate, 
whereas CDi({269},Z) is the highest one. We observed similar findings un- 
der model A^2 and presented some findings in Figure 3(d) and (e). 
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Fig. 3. Fa/e infant growth data. Panel (a) shows rui versus 7'(7|A^i), in which the ten 
subjects with the largest values of degree of perturbation or cluster size are highlighted; 
panel (b) shows 'P(/|A^i) versus CD(/), in which the top ten influential subjects are 
highlighted; panel (c) shows 'P(7|A1i) versus CSCDi(/,Z), in which the top eleven in- 
fluential subjects are highlighted; and panels (d), (e) and (f), respectively, show VillM), 
CSCDi(J,Z) and Pb(I,'Z) for models Mi and M2- 



We used Z) to quantify whether a specific subject is influential rel- 

ative to the fitted model Adi [Figure 3(f)]. For instance, since CD({246}) = 
0.253, it is unclear whether subject 246 is influential or not according to CD, 
whereas we have CSCDi({246}, Z) = 21.443 and Pb({246},Z) = 1.0. Thus, 
subject 246 is really influential after eliminating the effect of the cluster size. 
Moreover, it is difficult to compare the influential levels of subjects 274 and 
285 using CD. All of the conditionally scaled Cook's distances and associ- 
ated quantities suggest that subject 274 is more influential than subject 285 
after eliminating the degree of perturbation difference. We observed similar 
findings under model M2 and omitted them due to space limitations. See 
Figure 3(d) and (e) for details. 

We compared the goodness of fit of models Aii and 7W2 to the data by us- 
ing the proposed case-deletion measures. First, inspecting Figure 3(d) reveals 
a strong similarity between the degrees of perturbation under models A4i 
and A^2 for all subjects. Second, by using the conditionally scaled Cook's dis- 
tance, we observed different levels of influence for the same subject under A^i 
and M2. For instance, CSCDi(/, Z) identifies subjects 246, 141, 109, 193 and 
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31 as the top five influential subjects under Mi, whereas it identifies sub- 
jects 274, 217, 90, 109 and 289 as the top ones under A^2- Finally, examining 
Pb(/, Z) reveals a large percentage of influential points for model Mi, but 
a small percentage of influential points for model A42', see Figure 3(f) for 
details. This may indicate that model M2 outperforms model Mi. Further- 
more, although we may develop goodness-of-fit statistics based on the scaled 
Cook's distances and show that model M2 outperforms model Mi, this will 
be a topic of our future research. 

In summary, the use of the new case-deletion measures provides new in- 
sights in real data analysis. First, V{I\M) explicitly quantifies the degree of 
perturbation introduced by deleting each subject. Second, CSCDfc(/, Z) for 
k = 1,2 explicitly account for the degree of perturbation for each subject. 
Third, Pb{I, Z) allows us to quantify whether a specific subject is influential 
relative to the fitted model. Fourth, inspecting Pb(/, Z) and CSCDfc(/, Z) 
may delineate the potential misspecification of the covariance structure un- 
der model Ml. 

4. Discussion. We have introduced a new quantity to quantify the de- 
gree of perturbation and examined its properties. We have used stochastic 
ordering to quantify the relationship between the degree of the perturbation 
and the magnitude of Cook's distance. We have developed several scaled 
Cook's distances to address the fundamental issue of deletion diagnostics 
in general parametric models. We have shown that the scaled Cook's dis- 
tances provide important information about the relative influential level of 
each subset. Future work includes developing goodness-of-fit statistics based 
on the scaled Cook's distances, developing Bayesian analogs to the scaled 
Cook's distances, and developing user-friendly R code for implementing our 
proposed measures in various models, such as survival models and models 
with missing covariate data. 

APPENDIX 

The following assumptions are needed to facilitate the technical details, 
although they are not the weakest possible conditions. Because we develop 
all results for general parametric models, we only assume several high-level 
assumptions as follows. 

Assumption A2. O^j] for any / is a consistent estimate of 0* € 0. 

Assumption A3. All p(Y^jj\0) are three times continuously differen- 
tiable on and satisfy 

iogp(Y[,]|0) = iogp(Y[,]|0,) + A{efj^^[i]{e,) 
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in which \R[j]{6)\ = Op{l) umfoimly ior all 6 e B{e^,6on^^/'^) = {9 : ^/E\\e - 
04 < So}, where A(6>) = 0-0,, Jn,[i]{0) = dg\ogp{Yyi^\e) and F„,[j](6>,) = 

dl\ogp{Yyi^\e). 

Assumption A4. For any / and Z, supeg5(g^ n-^r25^)n~^'^Jn\i\{0) = 
sup \\Fr,,ii]{0)-E[Fi{e)\M,Z]\\=Op{V^), 

eG_B(6»,,n-V25o) 

sup n-iF„.[,](0) - F„,[,](0')ll = opil), 

0,6»'eB(6l*,n-i/25o) 

and < mfe6B(0,,5on-i/2) Amm(n"^F„jj](0)) < supeg5(e^ 5^„-i/2) Amax(n~^ x 
F,,[,](0))<oo. 

Assumption A5. For any set / and Z, 

sup J/(0) = Op(y<O), 

6»GB(6».,n-i/25o) 

sup ||f,(0)||=Op(n(/)), 

6leB{e.,n-i/25o) 

sup ||fK0) - i^[f7(0)|A^, Z]|| = Op(v^). 

Remarks. Assumptions A2-A5 are very general conditions and are gen- 
erahzations of some higher level conditions for the extremum estimator, such 
as the maximum likelihood estimate, given in Andrews [2]. Assumption A2 
assumes that the parameter estimates with and without deleting the ob- 
servations in the subset / are consistent. Assumption A3 assumes that the 
log-likelihood functions for any / and Yj/j admit a second-order Taylor's 
series expansion in a small neighborhood of 0*. Assumptions A4 and A5 are 
standard assumptions to ensure that the first- and second-order derivatives 
of p(Y[/]|0) andp(Y/|Y[7],0) have appropriate rates of n and nj [2, 30]. Suf- 
ficient conditions of Assumptions A2-A5 have been extensively discussed in 
the literature [2, 30]. 

Proof of Theorem 1. (P.a) directly follows from the Jensen inequal- 
ity, (2.6) and (2.7). For (P.b), if / is an empty set, then KL(Y, e\I) = and 
thus V{I\M) = 0. On the other hand, if V{I\M) = 0, then KL(Y,0|/) = 
for almost every 0. Thus, by using the Jensen inequality, we have p(Y/|Y[7-], 
6) =p(Y/|Y[/],0=f) for all gQ. Based on the identifiability condition, we 
know that / must be an empty set. Let I1.2 = /i — l2- It is easy to show that 

p{Yj, I Y[,^] , 6) = p{Yi, , Yj„, I Y[,,] , 6) = p{Yi, | Y[,,] , 0)p{Y^j^^ | Y[,^] , 6) . 
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Thus, by substituting the above equation into (2.6), we have 
ViIi\M)=V{l2\M) 

(A.l) 

+ /p(e|(..,E„.>(Y|0)log(iM^) m^, 

in which the second term on the right-hand side can be written as 



/ 



.P(Y[/2]|Y[7^], 

which yields (P.c). Based on the assumption of (P.d), we know that 

p(Y[,,] I Y[,^] , e) = p(Y,,, I Y[,^] , 6) = p{Yi„, I Y[,^.,] , 6) 

for all 6. Thus, the second term on the right-hand side of (A.l) reduces to 
'P{Ii-2\-M.)^ which finishes the proof of (P.d). □ 

Proof of Theorem 2. (a) Let Is = Ji \ /2, h is a union of two disjoint 
sets /a and I2. Without loss of generality, Hj^ can be decomposed as 

H,, - X,,(X X) X,, - X,3(X^X)-X|; ) ■ 

Let Ai^i > • • • > Ai > and A2,i > • • • > ^2,n{i2) > be the ordered 
eigenvalues of Hj^^ and Hj^, respectively, where n(/fc) denotes the num- 
ber of observations in 1^ for k = 1,2. It follows from Wielandt's eigen- 
value inequality [13] that Ai^/ > A2,i for all / = 1, . . . , n(/2). For k = 1,2, 
we define Tk^kV]^ as the spectral decomposition of Hj^ and = (J-n{ik) ~ 
Afc)~^/^r|'e/j. = {hk,i, ■ ■ ■ ) ^fc,n{/fc))"^) where Pfc is an orthnormal matrix and 
Afc = diag(Afc^i, . . . , Xk,n{ik))- I* ^e shown that for A: = 1, 2, 

1 "^^'"^ A 

hfc~iV(0,a2l„(,^)) and CD(4) = ^ ^ — 

J = l 

Since /(x) = x/(l — x) is an increasing function of x € (0, 1), this completes 
the proof of Theorem 2(a). 

Note that CD(/) = (ct^)^^ E"=i Aj(l - Aj)^^/!^, where the \j are the 
eigenvalues of Hj and h= (/ii, . . . , /i„(/))^ ~ A^(0, cr^I„(/)). Moreover, the 
distribution of A is uniquely determined by Hi. Combining h ~ iV(0, (T^I„(/)) 
with the assumptions of Theorem 2(b) yields that CD(/) and CD(/') fol- 
low the same distribution when n{I) = n[I'). Furthermore, we can always 
choose an I2 such that n{l2) = n[l2) and Ji C /2- Following arguments in 
Theorem 2(a), we can then complete the proof of Theorem 2(b). □ 
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Proof of Theorem 3. (a) It follows from a Taylor series expansion 
and Assumption A3 that 

5elogp(Y[,]|0[,]) = = aelogp(Y[,]|0) + 5|logp(Y[,]|0)(0[,] - 9), 

where 9 = + (1 — t)9 for t G [0, 1]. Combining this with Assumption A4 

and the fact that dglogp{Y\9) = dglogp{Y[j]\9) + dglogp(Yi\Y[j],9) = 0, 
we get 

-0 = [-a2logp(Y[,]|0)]-i9,logKY[,]|0)[l + Op(l)] 
(A.2) =-[-a|logp(Y[,]|0)]-i5,logp(Y,|Y[,],0)[l + Op(l)]. 

Substituting (A.2) into CD(/) = - 0)^F„(0)(0[7] - 9) completes the 
proof of Theorem 3(a). 

(b) It follows from Assumptions A2-A4 that 

9-9, = Fn{9,y^delogp{Y\9,)[l + 0^(1)] 

= Fn{9,)-^[delogp{Y[i]\9,) + dglogp{Yj\Y[j],9,)][l + Op{l)]. 

Let Ji{9) = 96ilogp(Y/|Y[j],0). Using a Taylor series expansion along with 
Assumptions A4 and A5, we get 

Ji{9) = Ji{9,) - si{9,){9 - 9,)[1 + Op(l)] 

= Jii9,) - E[si{9,)\M]{9 - 9,)[1 + Op(l)] 

(A.3) 

= ({Ip - E[sj{9)\M]Fni9,y'}Jii9,) 

-E[si{9)\M]FM-'delogp{Y[j]\9,))[l + Op{l)]. 
Since E[Ji{9,)delogp{Yij^\9,)\M]=0, 

E[Jj{9)Ji{9f\M] 

= E[si{9,)\M]Fn{9,)-\FM - E[sii9,)\M]}[l + Op{l)]. 

It follows from Assumption A4 that for in a neighborhood of 9,, F„(0) and 
Fni9*) - iii9) can be replaced by E[Fni9)\M] and E[Fni9,) - ii{9)\M], 
respectively, which completes the proof of Theorem 3(b). 

(c) Similarly to Theorem 3(b), we can prove Theorem 3(c). □ 

SUPPLEMENTARY MATERIAL 

Supplement to "Perturbation and scaled Cook's distance": 

(DOI: 10. 1214/12- AOS978SUPP; .pdf). We include two theoretical examples 
and additional results obtained from the Monte Carlo simulation studies and 
real data analysis. 
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